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qj , While the axiomatic introduction of a probability distribution over a space is 

^y^ ' common, its use for making predictions, using physical theories and prior knowl- 

(^ ' edge, suffers from a lack of formalization. We propose to introduce, in the space of 

^N , all probability distributions, two operations, the OR and the and operation, that 

bring to the space the necessary structure for making inferences on possible values 

^ ■ of physical parameters. While physical theories are often asumed to be analytical, 

^^ . we argue that consistent inference needs to replace analytical theories by proba- 

(^ I bility distributions over the parameter space, and we propose a systematic way of 

On ' obtaining such "theoretical correlations" , using the OR operation on the results of 

p2 . physical experiments. Predicting the outcome of an experiment or solving "inverse 

^^ I problems" are then examples of the use of the AND operation. This leads to a simple 

and complete mathematical basis for general physical inference. 
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1 Introduction 

Why has mathematical physics become so universal? Does it allow the proper formulation 
of usual physical problems? Some reasons explain the popularity of mathematical physics. 
One reason is practical: mathematical physical theories may condensate a huge number 
of experimental results in a few functional relationships. Perhaps more importantly, these 
relationships usually have a tremendous power of extrapolation, allowing the prediction of 
the outcome of experiments never performed. Psychologically, this capacity of predicting 
the outcome of experiments gives the very satisfactory feeling of "understanding" . 

Today, most scientists accept Popper's (|l]) point of view that physics advances by pos- 
tulating mathematical relations between physical parameters. While it is fully recognized 
that physical theories should be confronted with experiments. Popper emphasized that 
sucessful predictions, in whatever number, can never prove that a theory is correct, but 
one single observation that contradicts the predictions of the theory is enough to refute, 
to falsify the whole theory. He also stressed that these contradictory results are of utter- 
most importance for the advance of physics. When Michelson and Morley (0) could not 
find the predicted difference in the speed of light when the observer changes its velocity 
relative to the source, they broke the ground for the replacement of classical by relativistic 
mechanics. 

Physical theories are conceptual models of reality. A good physical theory contains 
some or all of the following elements: 

i) a modelisation of the space-time (for instance, as a four-dimensional continuum, or 
as a fractal entity) 

ii) a modelisation of the objects of the "universe" (for instance, as point particles, or 
as continuous media) 

iii) a recognition of the significant parameters in the experiments to be performed and 
a precise, operational, definition of these parameters 

iv) mathematical relations postulated between these parameters, obtained by trying 
to obtain the best fit between observations and theoretical predictions. 

While no physics is possible without points (i-iii) above, point (iv), i.e., postulating 
functional relations between the parameters to be used is not a necessity. 

Any physical knowledge is uncertain, and estimation of uncertainties is crucial, for 
prosaic (e.g., preventing mechanical structures to collapse) as well as ethereal (e.g., using 
experimental results to decide between theories) reasons. The problem we face is that 
while considerable effort is spent in estimating experimental uncertainties, once a theory 
is postulated that is acceptable in view of these uncertainties, usual mathematical physics 
reasons as if the theory was exact. For instance, while we can use Gravitation Theory to 
predict the behaviour of space-time near the big-bang of certain models of the Universe, 
we have no means of estimating how uncertain are our predictions. 

This is more striking when using analytical theories to solve the so-called "inverse 
problems" (^|-§), where data, a priori information, and "physical theories" have to be 
used to make inferences about some parameters. The consideration of exact theories 
leads, at best, to inaccurate estimations, at worst, to mathematical inconsistencies (|^. 

This paper proposes an alternative to the common practice of postulating functional 
relationships between physical parameters. In fact, we propose a mathematical formaliza- 



tion of pure empiricism, as opposed to mathematical rationalism. Essentially, we suggest 
to replace functional relationships between physical parameters by well defined probability 
distributions over the parameter space. 

The proposed formalism will, in some sense, only be a sort of "tabulation" of sys- 
tematically performed experiments. In some aspects it will be less powerful than the one 
obtained through the use of analytical theories (it will not be able to extrapolate); in some 
aspects it will be more powerful (it will be able to properly handle actual uncertainties). 

We will show how experiments could be, at least in principle, systematically performed 
so that a probability distribution in the parameter space is obtained that contains, as an 
analytical theory, the observed correlations between physical parameters, but, in addition, 
contains the full description of the attached uncertainties. We will also explain how, 
once such a theoretical probability distribution has been obtained, we can use it to make 
predictions — that will have attached uncertainties — or to use data to solve general inverse 
problems. 

To fulfill this project we need to complete classical probability theory. Kolmogorov (^ 
proposed an axiomatic introduction of the notion of probability distribution over a space. 
The definition of conditional probability is then the starting point for doing inferences, 
as, for instance, through the use of the Bayes theorem. But the space of all probability 
distributions (over a given space) lacks structure. We argue below that there are two 
natural operations, the OR and the and operation, to be defined over the probability 
distributions, that create the necessary structure: that of an inference space. We will 
see that while the OR operation corresponds to an obvious generalization of "making 
histograms" from observed results, the and operation is just the right generalization of 
the notion of conditional probability. 

2 The structure of an Inference Space 

Before Kolmogorov (H), probability calculus was made using the intuitive notions of 
"chance" or "hazard" . Kolmogorov's axioms clarified the underlying mathematical struc- 
ture and brought probability calculus inside well defined mathematics. In this section we 
will recall these axioms. Our opinion is that the use in physical theories (where we have 
invariance requirements) of probability distributions, through the notions of conditional 
probability or the so-called Bayesian paradigm suffers today from the same defects as 
probability calculus suffered from before Kolmogorov. To remedy this, we introduce in 
this section, in the space of all probability distributions, two logical operations (OR and 
and) that give the necessary mathematical structure to the space. 

2.1 Kolmogorov's concept of probability 

A point X , that can materialize itself anywhere inside a domain T> , may be realized, 
for instance, inside A , a subdomain of T> . The probability of realization of the point 
is completely described if we have introduced a probability distribution (in Kolmogorov's 
sense) on V , i.e., if to every subdomain ^ of P we are able to associate a real 
number P{A) , called the probability of A , having the three properties: 



• For any subdomain A oi V , P{A) > . 

• If Ai and Aj are two disjoint subsets of V , then, P{Ai U Aj) = P{Ai) + P{Aj) . 

• For a sequence of events Ai ^ A2 ^ ■ ■ ■ tending to the empty set, we have P{Ai) -^ . 

We will not necessarily assume that a probability distribution is normed to unity 
( -P(^) = 1 )• Although one refers to this as a measure, instead of a probability, we will 
not use this distinction. Sometimes, our probability distributions will not be normalizable 
at all ( -P('P) = cxo ). We can only then compute the relative probabilities of subdomains. 

These axioms apply to probability distributions over discrete or continuous spaces. 
Below, we will consider probability distributions over spaces of physical parameters, that 
are continuous spaces. Then, a probability distribution is represented by a probability 
density (note |^ explains the difference between a probability density and a volumetric 
probability) . 

In the next section, given a space T> , we will consider different probability dis- 
tributions P , Q . . . Each probability distribution will represent a particular state of 
information over V . In what follows, we will use as synonymous the terms "probability 
distribution" and "state of information" . 

2.2 Inference space 

We will now give a structure to the space of all the probability distributions over a given 
space, by introducing two operations, the OR and the and operation. This contrasts 
with the basic operations introduced in deductive logic, where the negation ("not"), 
nonexistent here, plays a central role. In what follows, the OR and the AND operation will 
be denoted, symboUically, by V and A . They are assumed to satisfy the set of axioms 
here below. 

The first axiom states that if an event A is possible for (P OR Q) , then the event 
is either possible for P or possible for Q (which his is consistent with the usual logical 
sense for the "or"): For any subset A , and for any two probability distributions P and 
Q , the OR operation satisfies 

(P yQ){A)^0 =^ PiA) ^ or QiA) ^ , 

the word "or" having here its ordinary logical sense. 

The second axiom states that if an event A is possible for (P and Q) , then the 
event is possible for both P and Q (which is consistent with the usual logical sense for 
the "and"): For any subset A , and for any two probability distributions P and Q , 
the AND operation satisfies 

(P ^Q){A)^Q =^ P{A) ^ and Q{A) ^ , 

the word "and" having here its ordinary logical sense. 

The third axiom ensures the existence of a neutral element, that will be interpreted 
below as the probability distribution carrying no information at all: There is a neutral 
element, M for the and operation, i.e., it exists a M such that for any probability 
distribution P and for any subset A , 

(M A P) {A) = (P A M) {A) = P{A) . 



The fourth axiom imposes that the OR and the and operations are commutative and 
associative, and, by analogy with the algebra of propositions of ordinary logic, have a 
distributivity property: the and operation is distributive with respect to the OR operation. 

The structure obtained when furnishing the space of all probability distributions (over 
a given space V ) with two operations OR and and, satisfying the given axioms constitutes 
what we propose to call an inference space. 

These axioms do not define uniquely the operations. Let /i(x) be the particular 
probability density representing M , the neutral element for the AND operation, and 
let p(x),g(x) ... be the probability densities representing the probability distributions 
P,Q... Using the notations (p V q) (x) and (p A q) (x) for the probability densities 
representing the probability distributions P\/ Q and P AQ respectively, one realization 
of the axioms (the one we will retain) is given by 

(pVg)(x)=p(x) + g(x) ; (p a g) (x) = ^^^^ ' W 

where one should remember that we do not impose to our probability distributions to be 
normalized. 

The structure of an inference space, as defined, contains other useful solutions. For 
instance, the theory of fuzzy sets (|10D uses positive functions p(x), g(x) . . . quite similar 
to probability densities, but having a different interpretation: the are normed by the 
condition that their maximum value equals one, and are interpreted as the "grades of 
membership" of a point x to the "fuzzy sets" P,Q . . . . The operations OR and AND 
correspond then respectively to the union and intersection of fuzzy sets, and to the 
following realization of our axioms: 

(pVg)(x) =max(p(x),g(x)) ; (p A g)(x) = min(p(x), g(x)) , (2) 

where the neutral element for the AND operation (intersection of fuzzy sets) is simply the 
function /i(x) = 1 . 

While fuzzy set theory is an alternative to classical probability (and is aimed at the 
solution of a different class of problems), our aim here is only to complete the classical 
probability theory. As explained below the solution given by equations |l] correspond to 
the natural generalisation of two fundamental operations in classical probability theory: 
that of "making histograms" and that of taking "conditional probabilities" . To simplify 
our language, we will sometimes use this correspondence between our theory and the 
fuzzy set theory, and will say that the OR operation, when applied to two probability 
distributions, corresponds to the union of the two states of information, while the and 
operation corresponds to their intersection. 

It is easy to write some extra conditions that distinguish the two solutions given by 
equations Q and 0. For instance, as probability densities are normed using a multiplicative 
constant (this is not the case with the grades of membership in fuzzy set theory), it makes 
sense to impose the simplest possible algebra for the multiplication of probability densities 
p(x), g(x) ... by constants A, /i . . . : 

[(A + /i)p] (x) = (Ap V I2p) (x) ; [X{pA q)] (x) = {Xp A q) (x) = {p A Xq) (x) . (3) 



This is different from finding a (minimal) set of axioms cliaracterizing (uniquely) the 
proposed solution, which is an open problem. 

One important property of the two operations OR and and just introduced is that of 
invariance with respect to a change of variables. As we consider probability distribution 
over a continuous space, and as our definitons are independent of any choice of coordinates 
over the space, it must happen that we obtain equivalent results in any coordinate system. 
Changing for instance from the coordinates x to some other coordinates y , will change 
a probability density p(x) to p(y) = p(x) |9x/9y| . It can easily be seen (|llj) that 
performing the OR or the AND operation, then changing variables, gives the same result 
than first changing variables, then, performing the OR or the AND operation. 

Let us mention that the equivalent of equations |l| for discrete probability distributions 
is: 

(P V q). =Pi + qi ; (p A q)^ = -^^ . (4) 

Although the OR and and notions just introduced are consistent with classical logic, 
they are here more general, as they can handle states of information that are more subtle 
than just the "possible" or "impossible" ones. 

2.3 The interpretation of the OR and the AND operation 

If an experimenter faces realizations of a random process and wants to investigate the 
probability distribution governing the process, he may start making histograms of the 
realizations. For instance, for realizations of a probability distribution over a continuous 
space, he will obtain histograms that, in some sense, will approach the probability density 
corresponding to the probability distribution. 

A histogram is typically made by dividing the working space into cells, and by counting 
how many realizations fall inside each cell. A more subtle approach is possible. First, 
we have to understand that, in the physical sciences, when we say "a random point has 
materialized in an abstract space", we may mean something like "this object, one among 
many that may exist, vibrates with some fixed period; let us measure as accurately as 
possible its period of oscillation". Any physical measure of a real quantity will have 
attached uncertainties. As explained in section p.2| , this means that when, mathematically 
speaking, we measure "the coordinates of a point in an abstract space" we will not obtain 
a point, but a state of information over the space, i.e., a probability distribution. 

If we have measured the coordinates of many points, the results of each measure- 
ment will be described by a probability density J5j(x) . The union of all these, i.e., the 
probability density 

(piVp2V...)(x)=^p,(x) (5) 

i 

is a finer estimation of the background probability density than an ordinary histogram, as 
actual measurement uncertainties are used, irrespectively of any division of the space into 
cells. If it happens that the measurement uncertainties can be described using box-car 
functions at fixed positions, then, the approach we propose reduces to the conventional 
making of histograms. This is illustrated in figure |l|. 





Figure 1: Illustration of the OR operation applied to probability distributions. A his- 
togram is made (see top of the figure) by dividing the working space into cells, and by 
counting how many realizations fall inside each cell. A more subtle approach is possi- 
ble. First, we have to understand that, in the physical sciences, when we say "a random 
point has materialized in an abstract space", we may mean something like "this object, 
one among many that may exist, vibrates with some fixed period; let us measure as ac- 
curately as possible its period of oscillation". Any physical measure of a real quantity 
will have attached uncertainties. This means that when, mathematically speaking, we 
measure "the coordinates of a point in an abstract space" we will not obtain a point, but 
a state of information over the space, i.e., a probability distribution. If we have mea- 
sured the coordinates of many points, the results of each measurement will be described 
by a probability density Pi{x) . The union of all these, i.e., the probability density 
(pi V p2 V . . .)(x) = J2iPi{^) is a finer estimation of the background probability density 
than an ordinary histogram, as actual measurement uncertainties are used, irrespectively 
of any division of the space into cells. If it happens that the measurement uncertain- 
ties can be described using box-car functions (at fixed positions), then, the approach we 
propose reduces to the conventional making of histograms. 
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Figure 2: Illustration of the AND operation applied to probability distributions. This 
figure explains that our definition of the and operation is a generalization of the notion of 
conditional probability. A probability distribution P( ■ ) is represented (left of the figure) 
by its probability density. To any region A of the plane, it associates the probability 
P{A) . If a point has been realized following the probability distribution P{ ■ ) and we 
are given the information that, in fact, the point is "somewhere" inside the region B , then 
we can update the prior probability P( ■ ) , replacing it by the conditional probability 
P( ■ \B) = P{ ■ n B)/ P{B) . It equals P( ■ ) inside B and is zero outside (center of 
the figure). If instead of the hard constraint x G i3 we have a soft information about the 
location of x , represented by the probability distribution Q{ ■ ) (right of the figure), the 
intersection of the two states of information P and Q gives a new state of information 
(here, /x(a;) is the probability density representing the state of null information, and, to 
simplify the figure, has been assumed to be constant). The comparison of the right with 
the center of the figure shows that the and operation generalizes the notion of conditional 
probability. In the special case where the probability density representing the second state 
of information, Q( ■ ) , equals the null information probability density inside the domain 
B and is zero outside, then, the notion of intersection of states of information exactly 
reduces to the notion of conditional probability. 



Figure ^ explains that our definition of tlie and operation is a generalization of the 
notion of conditional probability. A probability distribution P{ ■ ) is represented, in 
the figure, by its probability density. To any region A of the plane, it associates the 
probability P{A) . If a point has been realized following the probability distribution 
P( • ) and we are given the information that, in fact, the point is "somewhere" inside 
the region B , then we can update the prior probability P( ■ ) , replacing it by the 
conditional probability P( ■ \B) = P{ ■ n B)/P{B) . It equals P( ■ ) inside B 
and is zero outside (center of the figure). If instead of the hard constraint x E B 
we have a soft information about the location of x , represented by the probability 
distribution Q{ ■ ) (right of the figure), the intersection of the two states of information 
P and Q gives a new state of information (here, fi{x) is the probability density 
representing the state of null information, and, to simplify the figure, has been assumed 
to be constant). The comparison of the right with the center of the figure shows that the 
AND operation generalizes the notion of conditional probability. In the special case where 
the probability density representing the second state of information, Q{ ■ ) , equals the 
null information probability density inside the domain B and is zero outside, then, the 
notion of intersection of states of information exactly reduces to the notion of conditional 
probability. 

Now the interpretation of the neutral element for the and operation can be made 
clear. We postulated that the neutral probability distribution M is such that for any 
probability distribution P , P AM = P . This means that if a point is realized according 
to a probability distribution P , and if a (finite accuracy) measure of the coordinates of the 
point produces the information represented by M , the posterior probability distribution, 
P A M is still P : the probability distribution M is not carrying any information at 
all. Accordingly, we call M the null information probability distribution. Sometimes, 
the probability density representing this state of null information is constant over all the 



space; sometimes, it is not, as explained in section p.l| . It is worth mentioning that this 
particular state of information enters in the Shannon's definition of Information Content 

mm- 

It is unfortunate that, when dealing with probability distributions over continous 
spaces, conditional probabilities are often misused. Note ( p!4D describes the so-called Borel- 
Kolmogorov paradox: using conditional probability densities in a space with coordinates 
(x, y) will give results that will not be consistent with those obtained by the use of 
conditional probability densities on the same space but where other coordinates [u, v) 
are used (if the change of coordinates is nonlinear). Jaynes (^) gives an excellent, explicit, 
account of the paradox. But his choice for resolving the paradox is different from our's: 
while Jaynes just insists on the technical details of how some limits have to be taken 
in order to ensure consistency, we radically decide to abandon the notion of conditional 
probability, and replace it by the intersection of states of information (the and operation) 
which is naturally consistent under a change of variables, as demonstrated in note (pl^. 
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3 Physical parameters 

Crudely speaking, a physical parameter is anything that can be measured. For a physical 
parameter, like a temperature, an electric field, or a mass, can only be defined by pre- 
scribing the experimental procedure that will measure it. Cook ([T6|) discusses this point 
with lucidity. 

The theory to be developed in this article will be illustrated by the analysis of objects 
that have a characteristic length, L , affected by phenomena that have a characteristic 
period, T . A measurement of a parameter is performed by realizing the conventional unit 
(i.e., the meter for a length, the second for a duration) and by comparing the parameter 
to the unit. We have then to turn to the definition of the units of time duration and of 
length. 

At present, the second is defined as the duration of 9 192 631 770 periods of the 
radiation corresponding to the transition between the two hyperfine levels of the ground 
state of the caesium-133 atom. Practically this means that a beam of ca3sium-133 atoms 
are submitted to an electromagnetic field of adjustable frequency: when the imposed 
frequency is such that it causes the transition between the two hyperfine levels of the 
ground state of the atoms, the standard of frequency (and, thus, of period) has been 
realised. 

Until 1991, the unit of length used to be defined independently of that of time duration. 
Now the meter is connected to the second by defining the value of the velocity of light 
as c = 299 792 458 ms~^. This means, in fact, that lengths are measured by measuring 
the time it takes light to traverse them (and then, converting to distance through this 
conventional value of c ). 

3.1 The noninformative probability density for physical param- 
eters 

Once a physical parameter has been defined, it is possible to associate to it a particular 
probability distribution, that will represent, when making a measurement, the absence of 
information on the possible outcome of the experiment. 

Assume that, furnished with our definition of the unit of time duration, we wish to 
measure the period of some object. It can be the period of a rotating galaxy, or the 
period of a XVII-th century pendulum, or the period of a vibrating molecule: we do not 
know yet. Let us denote by p{T) the probability density representing this state of total 
ignorance. The frequency v associated to the period T is z/ = 1/T . /^From p(T) we 
can, using the general rule of change of variables, deduce the probability density for the 
frequency: qif) = p(T) \dT /du\ = p(T)/z/^ . 

Now, the definition of the unit of time duration is undistinguishable from the definition 
of the unit of frequency. In fact, when trying to define the standard of time we said "when 
the imposed frequency is such that is causes the transition between the two hyperfine 
levels of the ground state of the atoms, the standard of frequency has been realised" , 
which shows how closely related are the reciprocal parameters period-frequency: we can 
not define the unit second without defining, at the same time, the unit Hertz. 

We find here, at a very fundamental level, the class of reciprocal parameters analyzed 
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by Harold Jeffreys (p!7|). As he argued, the null information probability density must have 
the same form for the two parameters, i.e., p(-) and g(-) must be the same function. 
Then, the constraint q{v) = p(T)/z/^ , seen above, gives, up to a multiplicative constant, 
the solution 

PiT) = ^ ; li^) = l- (6) 

The range of time durations (or of periods) considered in physics spans many orders 
of magnitude (from periods of atomic objects to cosmological periods). Physicists then 
often use a logarithmic scale, defining, for instance, T* = log(T/To) and u* = log(z//z/o) , 
where the two constants uq and Tq can be arbitrary (p!8|). Transforming the probability 
densities in ^ to the logarithmic variables gives p*{T*) = 1 and q*{i'*) = 1 • The loga- 
rithmic variables (that take values on all the real line) have a constant probability density. 
This, is fact, is the deep interpretation of the 1/x probability densities in equations |^. 
The particular variables for which the probability density representing the state of null 
information is a constant over all the space can be named Cartesian: they are more "nat- 
ural" than others, as are the usual Cartesian coordinates in Euclidean spaces (|T^). That 
these "Cartesian" variables are not only more natural, but also more practical than other 
variables, can be understood by considering that manufacturers of pianos space notes with 
constant increments not of frequency, but of the associated logarithmic variable. 

We have seen that the definition of length is today related to that of time duration 
through the velocity of light. We could say that the electromagnetic wave of the radia- 
tion that defines the unit of time, defines, through its wavelength, the unit of distance. 
But, here again, we have a perfect symmetry between the wavelength and its inverse, 
the wavenumber. This is why we take the function h{L) = 1/L to describe the null 
information probability density for the length of an object (|2Q|). 



3.2 Measuring physical parameters 

To define the experimental procedure that will lead to a "measurement" we need to 
conceptualize the objects of the "universe": do we have point particles or a continuous 
medium? Any instrument that we can build will have finite accuracy, as any manufacture 
is imperfect. Also, during the measurement act, the instrument will always be submitted 
to unwanted sollicitations (like uncontrolled vibrations) . 

This is why, even if the experimenter postulates the existence of a well defined, "true 
value" , of the measured parameter, she/he will never be able to measure it exactly. Careful 
modeling of experimental uncertainties is not easy. Sometimes, the result of a measurement 
of a parameter p is presented as p = po ± a , where the interpretation of a may be 
diverse. For instance, the experimenter may imagine a bell-shaped probability density 
around po representing her/his state of information "on the true value of the parameter" . 
The constant a can be the standard deviation (or mean deviation, or other estimator of 
dispersion) of the probability density used to model the experimental uncertainty. 

In part, the shape of this probability density may come from histograms of observed 
or expected fiuctuations. In part, it will come from a subjective estimation of the de- 
fects of the unique pieces of the instrument. We postulate here that the result of any 
measurement can, in all generality, be described by defining a probability density over 
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the measured parameter, representing the information brought by the experiment on the 
"true", unknowable, value of the parameter. The official guidelines for expressing uncer- 
tainty in measurement, as given by the International Organization for Standardization 
(ISO) and the National Institute of Standards and Technology (^) although stressing 
the special notion of standard deviation, are consistent with the possible use of general 
probability distributions to express the result of a measurement, as advocated here. 

Any shape of the density function is not acceptable. For instance, the use of a Gaussian 
density to represent the result of a measurement of a positive quantity (like an electric 
resistivity) would give a finite probability for negative values of the variable, which is 
inconsistent (a lognormal probability density, on the contrary, could be acceptable). 

In the event of an "infinitely bad measurement" (like when, for instance, an unexpected 
event prevents, in fact, any meaningful measure) the result of the measurement should be 
described using the null information probability density introduced above. In fact, when 
the density function used to represent the result of a mesurement has a parameter a 
describing the "width" of the function, it is the limit of the density function for a — > oo 
that should represent a measurement of infinitely bad quality. This is consistent, for 
instance, with the use of a lognormal probability density for a parameter like an electric 
resisitivity r , as the limit of the lognormal for o" ^ oo is the 1/r function, which is 
the right choice of noninformative probability density for r . 

Another example of possible probability density to represent the result of a measure- 
ment of a parameter p is to take the noninformative probability density for pi < p < P2 
and zero outside. This fixes strict bounds for possible values of the parameter, and tends 
to the noninformative probability density when the bounds tend to infinity. 

The point of view proposed here will be consistent with the the use of "theoretical 



parameter correlations" as proposed in section |4.4| , so that there is no difference, from 
our point of view, between a "simple measurement" and a measurement using physical 
theories, including, perhaps, sophisticated inverse methods. 

4 Bayesian physical theories 

Physical "laws" prevent us from setting arbitrarily some physical parameters. For in- 
stance, we can set the length of a tube where a free fall experiment will be performed, 
and we can also decide on the place and time of the experiment, but the time duration of 
the free fall is "imposed by Nature" . Physics in much about the analysis of these physical 
correlations between parameters. 

Typically, a set i of independent parameters is identified, and experiments are per- 
formed in order to measure the values of a set d of dependent parameters (|2^) . Analytical 
physical theories try then to express the result of the observations by a functional rela- 
tionship d = d(i) . In fact, saying that the independent parameters are "set" and the 
dependent parameters "measured" is an oversimplification, as all the parameters must be 
measured. And, as discussed in the previous section, uncertainties are present in every 
measurement. The values of the parameters that are set (the independent parameters) are 
never known exactly. The measures of the dependent parameters have always uncertain- 
ties attached. Assume we have made a large number of experiments, that show how the 
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dependent parameters correlate with the independent ones. Within the error bars of the 
experimental results it will always be possible to fit an infinity of functional relationships 
of the form d = d(i) . Adding more experimental points may help to discard some of 
the "theories" , but there will always remain an infinity of them. 

We formalize this fact at a fundamental level, by replacing the need of a functional 
relationship by the use of a probability distribution in the space of all the parameters 
considered, representing the actual information we may have. Not only this point of view 
corresponds to a certain philosophy of physics, it also leads — as discussed below — to the 
only consistent formalism we know that is able to predict values of possible observations 
and of the attached uncertainties. 

To be complete, we consider two cases where we may wish to analyze the physical 
correlations between parameters. The first case is when a repetitive phenomenon takes 
place spontaneously. The second case correspond to the case when an experimenter 
prompts a physical phenomenon, using an experimental arrangement. 

4.1 The "contemplative" point of view 

Consider an astronomer trying to analyze the "relationship" between the initial magni- 
tude m of shooting stars and the total distance A traveled by the meteors on the sky 
before disintegration. Each shooting star naturally appearing on the sky will allow one 
measurement of the two parameters m and A to be performed (and possibly other 
significant parameters). As discussed above, each result of a measurement will be repre- 
sented by a probability density. Let 9i{m, A) be the probability density representing the 
information obtained on the parameters m and A of the i-th shooting star. 

When a large enough number of shooting starts has been observed, the correlation 
between the parameters m and A is perfectly described by the probability density 
obtained by applying the OR operation (as defined by the first of equations |lD to the 
probability distributions represented by 6*1(771, A), ^2 ("^7 A), . . . , i.e., by the probability 
density 9{m,A) = J^i^iirrijA) . If, more generally, the observed parameters are generi- 
cally represented by x , and the result of the i-th experiment, by the probability density 
9i{'x) , then, 

9i^) = Y.0^{^). (7) 



The utility of this probability density will be explained in section ^A. 



4.2 The "experimental" point of view 

Here, the independent parameters i are "set" , and the dependent parameters d mea- 
sured. This case can be reduced to the previous case (the "contemplative" one) provided 
that the independent parameters i are "randomly generated" according to some refer- 
ence probability distribution, as, for instance, the null information probability distribution 
discussed in section ^]T] (this guaranteeing, in particular, that any possible region of the 
space of independent parameters will eventually be sampled). 

As above, if 9i{i, d) is the probability density representing the information on i and 
d obtained from the i-th experiment, after a large enough number of experiments has 
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been performed, the correlations between the dependent and the independent parameters 
are described by the probabihty density 6{i,d) = I]j 6*4(1, d) . In general, if the whole 
set of parameters is generically represented by x = {i, d} , and the result of the i-th 
experiment, by the probability density Oi{x.) , then equation holds again. 

We have here assumed that the values of the independent parameters are set ran- 
domly according to their null information probability density. This directly leads to 



the "Bayesian theory" ^(i, d) (this terminology being justified in section 44). A second 
option consists in defining physical correlations between parameters as a conditional prob- 
ability density for the dependent parameters, given the independent parameters, 6'(d|i) , 
but for the reasons explained elsewhere (|T^ the notion of conditional probability density, 
although a valid mathematical definition, is not of direct use for handling experimental 
results, unless enough care is taken. Assume, for instance, that the space of independent 
parameters is divided in boxes (multidimensional "intervals") and that the independent 
parameters can be set to values that are certain to belong to one of the boxes. Performing 
the experiment for each of the possible "boxes" for the independent parameters, and, cor- 
respondingly, measuring the values of the dependent parameters d will produce states 
of information that are crudely represented in figure ^. This collection of states of infor- 
mation correspond to the conditional probability density 6^(d|i) . The joint probability 
density in the (i, d) space that carries this information without carrying any informa- 
tion about the independent parameters (what we wish to call the "Bayesian theory") is 
then the product of the conditional probability density ^(d|i) by the null information 
probability density for the independent paremeters, say /ij(i) , i.e., the probability density 

e{i,d) = 0{d\i)Mi). (8) 

To be more accurate, if, in each experiment, the only thing we know about the independent 
parameters is the box where their value belongs, the measurement produces a probability 
density in the (i, d) space, say 9i{i, d) , that equals the product of a probability density 
over d (describing the result of the measurement of the dependent parameters) times 
a probability density that equals zero outside the box and equals the null information 
probability density inside the box. Applying the OR operation to all these probability 
densitues will also give the result of equation |^. 

Interpreting the conditional probability density ^(d|i) as simply putting some "error 
bars" around some "true functional relationship" d = d(i) , that will always escape to 
our knowledge, or assuming that the experimental knowledge ^(d|i) represents is the 
"real thing" , and that there is no necessity of postulating the existence of a functional 
relationship, is a methaphysical question that will not change the manner of doing physical 



inference. As explained in section 4.4, inference will combine this "theoretical knowledge" 



represented by 6(i, d) with further experiments using the and operation. 

4.3 An example of Bayesian theory 

The discussion on the noninformative priors, in section pj.l|, was made without reference 



to a particular kind of object to be investigated. Let us now turn to analyze the physics 
of the fall of objects at the surface of the Earth. 
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Figure 3: Dividing the space of independent parameters in boxes, setting the indepen- 
dent parameters to values that are certain to belong to one of the boxes, performing the 
experiment for each of the possible "boxes" for the independent parameters, and, corre- 
spondingly, measuring the values of the dependent parameters d will produce states of 
information that are crudely represented in this figure. See text for an explanation. 

Assume we have a tube (with vacuo inside) of length L and we want to analyze the 
time T it takes for a body to fall from the top to the bottom of the tube. Experiments 
readily show that 

L-\gT^^Q, (9) 

where g is the acceleration of gravity at the given location, but this "law" can not be 
exact for many reasons i) residual air resistance; ii) variation of gravity with height; iii) 
relativistic effects; iv) intrinsic (and so far unexplored) limitations of General Relativity; 
etc. 

We want to replace the line L = ^gT'^ by a probability density representing the 
actual knowledge that can be obtained from experiments. As explained in the previous 
section, the finite accuracy of any measurement will prevent the probability density from 
collapsing into a line "without thickness" . 

Let us face the actual problem of obtaining the probability density representing the 
theoretical/experimental knowledge on the physics of a falling body. In the case where 
the length L is first set, and then the time T of the fall of the body measured (this 
is, for instance, the way absolute gravimeters work, deducing, from the time T , the 
local value of the acceleration of gravity g ; we will later face the alternative possibility), 
the experimenter should receive tubes of different lengths Li, L2 . . . randomly generated 
according to the null information probability density for the length of an object, i.e., with 
the probability density 1/L . 

When the first tube is provided to him, the experimenter should perform the falling 
experiment and, using the best possible equipment, measure as accurately as possible the 
length L of the tube given to him and the time T it takes to the falling body to make 
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the distance. This would provide him with a probability density 6i{L,T) representing his 
knowledge of the realized value of the parameters. There is no reason for the uncertainties 
on L and T , as described by this probability density, to be independent. When a second 
tube, with random length, is provided to him, he should perform again the experiment 
and obtain a second probability density 6*2 (-^,7) . As already explained, the "Bayesian 
theory" corresponding to these experiments is then the union (in the sense defined above) 
of all the states of information obtained in all the individual experiments, when their 
number tends to infinity: 

oo 

9iL,T) = J29,{L,T). (10) 

Figure ^ schematizes the sort of probability density such a method would produce (|23|). 
We have explored the case where the length L of the tube is first set and, then, 
the falling experiment is performed, measuring the time T . The alternative is to fix 
the time duration T first and, then, to perform the the falling experiment, measuring 
the length L the falling body has traveled in that time. The two sorts of experiments 
are not identical, as the type of measurements performed will be different and will lead 
to different uncertainties. In this case, the experimenter is provided with time durations 
Ti , T2 . . . randomly selected according to the null information probability density for the 
period of a process, and obtains probability densities ^i{L, T), ^2(-^, T) . . . representing 
the results of the measurements. The union of all these states of information 

00 
^(L,T) = ^t9,(L,T), (11) 

would provide the "Bayesian theory" corresponding to that sort of experiment. 

There is no reason for the two "Bayesian theories" thus obtained to be identical, as they 
correspond to a different type of experiment. We are then faced with the conclusion that 
the replacement of an analytical equation by a probability density will lead to probability 
densities attached to the precise experiment being performed. In fact, this is not so 
different to what would have been obtained when seeking for a functional relationship, 
as the "best fitting curve" for the first kind of experiments may not be the "best fitting" 
one for the second kind of experiments. 

The formation of a "Bayesian theory" here made by summing small distributions ( "his- 
togramming") can be understood in two ways. First, we could perfectly well proceed in 
this way in practice, performing systematic measurements of parameter correlations, us- 
ing the best avalilable equipment. Alternatively, we can understand the proposed method 
as a thought experiment helping to clarify what "theoretical uncertainties" can be. These 
uncertainties can then be modeled using standard distributions (Gaussian, double expo- 
nential. . . ) in such a way that usable but still realistic probability distributions in the 
parameter space can be defined and used as "Bayesian theories" , as in the example shown 
in note (|2|). 

We will conclude this section with two remarks. First, it is not possible to sample 
a probability distribution that can not be normalized, as it is usually the case for the 
noninformative probabilities, like, for positive x , the 1/x distribution. Then, practical 
lower and upper bounds have to be used. Second, the number of experimental "points" 
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Figure 4: The free fall of an object inside a tube of length L takes some time duration T . 
Experiments show that there is a good correlation between L and T : with a good 
approximation, L — ^gT"^ ^ . An analitycal expression like L = \gT'^ can not 
be exact (any analytical theory is just an approximation of reality). An examination of 
the real experiments made to obtain the "theory" shows the presence of uncompressible 
uncertainties. Using the approach developed here, the existing correlations between L 
and T are represented by a probability density which replaces the classical notion of 
analytic theory. If, at some scale, these correlations may seem well described by an 
analytical expression (here, in the top figure, by a line), succesive magnifications (middle 
and bottom) end up by showing the actual size of the "theoretical uncertainties". In 
this example we have assumed that measurements of lengths and of time durations have 
constant relative errors (grossly exaggerated in this schematic drawing). The thickness of 
this theoretical distribution is of importance for: i) solving, in a mathematical consistent 
manner physical inference problems, and ii) accurately computing uncertainties between 
physical parameters, as, for instance, when predicting data values or when solving inverse 
problems. 
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that have to be used in order to have a good practical approximation of a "Bayesian 
theory" depends on the accuracy of the measurements. Enough experiments have to 
be done so that the sum in equations ^ and |T^ is smooth enough. The sharper the 
experimental design, the more experiments we will need (and the mode detail we will 
have) . 

4.4 Using a Bayesian physical theory 

Assume that enough experiments have been made, by skilled people, using the best avail- 
able equipment, following the guidelines of the previous section, so that the "Bayesian 
theory" 6{L,T) is available. Now a new tube is given to us, whose length has been 
randomly generated according to the null information probability density, 1/L . We 
perform the falling experiment, perhaps with a more modest equipment than that used to 
obtain the Bayesian theory, and measure the two parameters L and T , the result of the 
measurement being described by the state of information p{L, T) . How can we combine 
this information with the Bayesian theory, so that we can ameliorate our knowledge on L 
and T ? We are exactly here in the situation where the notion of conditional probability 
(in fact, our generalization of it) applies: we know that we have a realization in the (L, T) 
space generated according to the theoretical probability density d{L,T) and we have 
a state of information on this particular realization that is described by the probability 
density p{L, T) . The resulting state of information is then that obtained by applying 
the AND operation to these two states of information (i.e., in the language defined above, 
by taking their intersection). This gives 

In general, if i is the independent parameter set and d the dependent set, 

g(i, d) p(i, d) 

<^{h d) = rr-TT • (13) 

/i(i,d) 

If the information content concerning L contained in p{L, T) is very high (the 
length of the tube is well known) while the information on T is low, then, a{L,T) 
will essentially ameliorate our information on T . This corresponds to the solution of a 
classical prediction problem in physics (how long it will take for a stone to fall from the top 
of the tower of Pisa?). Reciprocally, if the information content concerning T contained 
in p{L, T) is very high (the time of the fall is well known) while the information on 
the length of the tube is low, then, cr{L, T) will essentially ameliorate our information 
on L . Then, equation ^ corresponds to the solution of an "inverse problem" , where 
"data" is used to infer the values of the parameters describing some system. This use of 
the notion of intersection of states of information to solve inverse problems was advocated 
by Tarantola and Valette (§) and Tarantola (^), who showed that this method leads to 
results consistent with more particular techniques (like least squares of least absolute 
values) when some of the subtleties are ignored (theoretical uncertainties neglected, etc.). 

We do not know of any alternative to our approach that solves consistently nonlinear 
inverse problems. 
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5 Discussion and Conclusion 

Introducing Kolmogorov's definition of probability distributions witliout introducing tlie 
two operations OR and and, is like introducing the real numbers witliout introducing the 
sum and the product: we may compute, replacing clear mathematical objects by intuitive 
operations, but we are lacking an important structure of the space. The two operations we 
have introduced satisfy so obvious axioms that is difficult to imagine a simpler structure. 

This structure may be used for many different inference problems, but we have chosen 
here an illustration in the realm of physics. We have replaced the notion of an analytical 
theory by the Bayesian notion of a probablity density representing all the experimentally 
obtained correlations between physical parameters, the space of independent parameters 
being visited randomly according to their null information probability density. Practically, 
some regions of the parameter space will not be accessible to investigation. Accordingly, 
the "result" of the measurement will be the null information probability density for the 
corresponding parameters. In other words, the "error bars" of a "Bayesian theory" may 
be large — or even infinite — for some regions of the parameter space. This is the 
typical domain where classical, analytical, theories extrapolate the equations that fit the 
observations made in a restricted region of the parameter space. No such extrapolation 
is allowed with our approach. 

Although we have only shown a simple example (the Galilean experiment), the me- 
thodology has a large domain of application. As a further example, concerning tensor 
quantities, we could examine the dependence between stress and strain for a given medium. 
This would involve: i) mathematical definition of strain from displacement; ii) operational 
definition of stress; and iii) analysis of the stress-strain correlation using the method 
described in this article. 

Analytical theories, when extrapolating, predict results that may not correspond to 
observations, when they are made. The theory is then "falsified" in the sense of Popper, 
and has to be corrected. A "Bayesian theory" can be indefinitely refined, as larger do- 
mains of the parameter space are accessible to experimentation, but never falsified. The 
present work shows that pure empiricism (as opposed to the mathematical rationalism 
of analytical theories) can be mathematically formalised. This formalism is the only one 
known by the authors that handles uncertainties consistently. 

If physicists enjoy the game of extrapolation (as, for instance, when pushing Einstein's 
gravity theory to the conditions prevailing in a Big Bang model of the Universe), engineers 
advance by performing experiments as close as possible to the conditions that will prevail 
"in the real thing" . 

Using the approach here proposed, the "=" sign is only used for mathematical def- 
initions, as, for instance, when defining a frequency from a period u = 1/T , or when 
using the mathematics associated to probability calculus. But the "=" sign is never used 
to describle physical correlations, that are, by nature, only approximate. These physical 
correlations are described by probability distributions. Some may see the systematic use 
of the "=" sign in mathematical physics as a misuse of mathematical concepts. 
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J^^dxp{x,yo) 
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q{u,v{u)) 
q[u\v = v[u)) — 



^ Umax ^ ^ ^ ' 
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